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A DIRECT FORMULATION FOR SPARSE PCA USING 
SEMIDEFINITE PROGRAMMING* 
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Abstract. Given a covariance matrix, we consider the problem of maximizing the variance 
VQ ' explained by a particular linear combination of the input variables while constraining the number 

^"^ ^ of nonzero coefficients in this combination. This problem arises in the decomposition of a covari- 

ance matrix into sparse factors or sparse PCA, and has wide applications ranging from biology to 
finance. We use a modification of the classical variational representation of the largest eigenvalue of 
a symmetric matrix, where cardinality is constrained, and derive a semidefinite programming based 
^^' relaxation for our problem. We also discuss Nesterov's smooth minimization technique applied to the 

C^ ' semidefinite program arising in the semidefinite relaxation of the sparse PCA problem. The method 

has complexity 0(n''\/log(n)/e), where n is the size of the underlying covariance matrix, and e is 
the desired absolute accuracy on the optimal value of the problem. 

o. 

^sj ^ Key words. Principal component analysis, Karhunen-Loeve transform, factor analysis, semidef- 

inite relaxation, Moreau-Yosida regularization, semidefinite programming. 

PJ ' AMS subject classifications. 90C27, 62H25, 90C22. 

C/3 i 1. Introduction. Principal component analysis (PCA) is a popular tool for data 

' analysis, data compression and data visualization. It has applications throughout 

science and engineering. In essence, PCA finds linear combinations of the variables 

(^ I (the so-called principal components) that correspond to directions of maximal variance 

in the data. It can be performed via a singular value decomposition (SVD) of the 

*vj \ data matrix A, or via an eigenvalue decomposition if A is a covariance matrix. 

f^ ' The importance of PCA is due to several factors. First, by capturing directions 

>0 \ of maximum variance in the data, the principal components offer a way to compress 

the data with minimum information loss. Second, the principal components are un- 
correlated, which can aid with interpretation or subsequent statistical analysis. On 
the other hand, PCA has a number of well-documented disadvantages as well. A par- 
ticular disadvantage that is our focus here is the fact that the principal components 
are usually linear combinations of all variables. That is, all weights in the linear com- 
bination (known as loadings) are typically non-zero. In many applications, however, 
the coordinate axes have a physical interpretation; in biology for example, each axis 
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H \ might correspond to a specific gene. In these cases, the interpretation of the princi- 

pal components would be facilitated if these components involved very few non-zero 
loadings (coordinates). Moreover, in certain applications, e.g., financial asset trading 
strategies based on principal component techniques, the sparsity of the loadings has 
important consequences, since fewer non-zero loadings imply fewer transaction costs. 
It would thus be of interest to discover sparse principal components, i.e., sets of 
sparse vectors spanning a low-dimensional space that explain most of the variance 
present in the data. To achieve this, it is necessary to sacrifice some of the explained 
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variance and the orthogonality of the principal components, albeit hopefully not too 
much. 

Rotation techniques are often used to improve interpretation of the standard 
principal components (see ^U] for example). Vines or Kolda and O'Leary 271 112| 
considered simple principal components by restricting the loadings to take values from 
a small set of allowable integers, such as 0, 1, and —1. Cadima and JoUiffe 1* proposed 
an ad hoc way to deal with the problem, where the loadings with small absolute 
value are thresholded to zero. We will call this approach "simple thresholding." 
Later, algorithms such as SCoTLASS 11 and SLRA "55", "5^ were introduced to find 
modified principal components with possible zero loadings. Finally, Zou, Hastie and 
Tibshirani [211 propose a new approach called sparse PCA (SPCA) to find modified 
components with zero loading in very large problems, by writing PCA as a regression- 
type optimization problem. This allows the application of LASSO j24| . a penalization 
technique based on the li norm. All these methods are either significantly suboptimal 
(thresholding) or nonconvex (SCoTLASS, SLRA, SPCA). 

In this paper, we propose a direct approach (called DSPCA in what follows) that 
improves the sparsity of the principal components by directly incorporating a sparsity 
criterion in the PCA problem formulation, then forming a convex relaxation of the 
problem. This convex relaxation turns out to be a semidefinite program. 

Semidefinite programs (SDP) can be solved in polynomial time via general-purpose 
interior-point methods I23[I25| . This suffices for an initial empirical study of the prop- 
erties of DSPCA and for comparison to the algorithms discussed above on small prob- 
lems. For high-dimensional problems, however, the general-purpose methods are not 
viable and it is necessary to exploit problem structure. Our particular problem can be 
expressed as a saddle-point problem which is well suited to recent algorithms based 
on a smoothing argument combined with an optimal first-order smooth minimization 
algorithm |21[ 1171 12] ■ These algorithms offer a significant reduction in computational 
time compared to generic interior-point SDP solvers. This also represents a change in 
the granularity of the solver, requiring a larger number of significantly cheaper iter- 
ations. In many practical problems this is a desirable tradeoff; interior-point solvers 
often run out of memory in the first iteration due to the necessity of forming and 
solving large linear systems. The lower per-iteration memory requirements of the 
first-order algorithm described here means that considerably larger problems can be 
solved, albeit more slowly. 

This paper is organized as follows. In section [21 we show how to efficiently maxi- 
mize the variance of a projection while constraining the cardinality (number of nonzero 
coefficients) of the vector defining the projection. We achieve this via a semidefinite 
relaxation. We briefly explain how to generalize this approach to non-square matrices 
and formulate a robustness interpretation of our technique. We also show how this 
interpretation can be used in the decomposition of a matrix into sparse factors. Sec- 
tion [Sj describes how Nesterov's smoothing technique (see [22, [201) can be used to 
solve large problem instances efficiently. Finally, section [HI presents applications and 
numerical experiments comparing our method with existing techniques. 

Notation. In this paper, S" is the set of symmetric matrices of size n, and A„ 
the spectahedron (set of positive semidefinite matrices with unit trace). We denote 
by 1 a vector of ones, while Card(a:) denotes the cardinality (number of non-zero 
elements) of a vector x and Card(Ar) is the number of non-zero coefficients in the 
matrix X. For X G S", AT )^ means that X is positive semidefinite, ||A'||f is the 
Frobenius norm of X, i.e., ||Ar||i? = y/Tr{X'^), X'^^^^(X) is the maximum eigenvalue 
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of X and ||^||oo — '^^^^{i,j=i....,n} \^ij\^ while \X\ is the matrix whose elements are 
the absolute values of the elements of X. Finally, for matrices X,Y ^ S", X oY is 
the Hadamard (componentwise or Schur) product of X and Y . 

2. Semidefinite Relaxation. In this section, we derive a semidefinite program- 
ming (SDP) relaxation for the problem of maximizing the variance explained by a 
vector while constraining its cardinality. We formulate this as a variational problem, 
then obtain a lower bound on its optimal value via an SDP relaxation (we refer the 
reader to j26j or j^ for an overview of semidefinite programming). 

2.1. Sparse Variance MeLximization. Let A e S" be a given symmetric ma- 
trix and k be an integer with 1 < k < n. Given the matrix A and assuming without 
loss of generality that A is a covariance matrix (i.e. A is positive semidefinite), we 
consider the problem of maximizing the variance of a vector x £ R" while constraining 
its cardinality: 

maximize x^Ax 
(2.1) subject to |lx||2 = l 

Card(a;) < k. 

Because of the cardinality constraint, this problem is hard (in fact, NP-hard) and we 
look here for a convex, hence efficient, relaxation. 

2.2. Semidefinite Relaxation. Following the lifting procedure for semidefinite 
relaxation described in ^21j Qi d] for example, we rewrite (|2.1I) as: 



(2.2) 



maximize Tr{AX) 
subject to Tr(X) == 1 



Card(X) < k"^ 

X hO, Rank(X) = 1, 



in the (matrix) variable X € S". Programs (|2.HI and H2.2|l are equivalent, indeed 
if X is a solution to the above problem, then X y and Rank(X) = 1 mean that 
we have X = xx'^ , while Tr(X) = 1 implies that ||a:;||2 = 1. Finally, ii X — xx'^ 
then Card(X) < k^ is equivalent to Card(a;) < k. We have made some progress 
by turning the convex maximization objective x'^ Ax and the nonconvex constraint 
||a;||2 = 1 into a linear constraint and linear objective. Problem 1)2. 2|l is, however, still 
nonconvex and we need to relax both the rank and cardinality constraints. 

Since for every u G R", Card(u) = q imphcs ||u||i < ^||u||2, we can replace the 
nonconvex constraint Card(X) < k^, by aweakerbut convex constraint: 1"^|X|1 < fc, 
where we exploit the property that ||^||_f = ^ x^x = 1 when X — xx^ and Tr(X) = 
1. If we drop the rank constraint, we can form a relaxation of 1)2. 2|l and 1)2. l|l as: 



(2.3) ^r 



maximize Tr{AX) 
subject to Tr(X) = 1 

1^|X|1<A; 

XhO, 



which is a semidefinite program in the variable X G S", where k is an integer param- 
eter controlling the sparsity of the solution. The optimal value of this program will 
be an upper bound on the optimal value of the variational problem in 1)2.1(1 . Here, the 
relaxation of Card{X) in 1-^|X|1 corresponds to a classic technique which replaces 
the (non-convex) cardinality or Iq norm of a vector x with its largest convex lower 
bound on the unit box: |a;|, the ^i norm of a: (see or |H] for other applications). 
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2.3. Extension to the Non-Square Case. Similar reasoning applies to the 
case of a non-square matrix A E R™^", and the problem: 

maximize u^Av 

subject to ||u||2 = ||u||2 = 1 

Card(w) < fci, Card(w) < fc2, 

in the variables (w, v) £ R™ x R" where ki < m, k2 < n are fixed integers controlling 
the sparsity. This can be relaxed to: 

maximize Tr(A'^X^^) 
subject to XhO, Tr(X") = 1 

l^|X"|l<fc„ i = l,2 

i^\x''\i < Vhk-2, 

in the variable X G g™^" with blocks X^^ for i,j = 1, 2. Of course, we can consider 
several variations on this, such as constraining Card(M, v) = Card(u) + Card(w). 

3. A Robustness Interpretation. In this section, we show that problem (|2.3|l 
can be interpreted as a robust formulation of the maximum eigenvalue problem, with 
additive, componentwise uncertainty in the input matrix A. We again assume A to 
be symmetric and positive semidefinite. 

In the previous section, we considered a cardinality-constrained variational for- 
mulation of the maximum eigenvalue problem: 

maximize x'^Ax 
subject to ||a;||2 — 1 

Card(a;) < k. 

Here we look instead at a variation in which we penalize the cardinality and solve: 

fo ^\ maximize x'^ Ax — p Card (x) 

subject to ||x||2 — 1, 

in the variable x G R", where the parameter p > controls the magnitude of the 
penalty. This problem is again nonconvex and very difficult to solve. As in the last 
section, we can form the equivalent program: 

maximize Tr{AX) - pCard{X) 
subject to Tr(A) = 1 

X yO, Rank(A) = 1, 

in the variable A G S". Again, we get a relaxation of this program by forming: 

maximize Tr(AA) - pl^|A|l 

(3.2) subject to Tr(A) = 1 

A^O, 

which is a semidefinite program in the variable A G S", where p > controls the 
magnitude of the penalty. We can rewrite this problem as: 

(3.3) max min Tr(X(A + U)) 

X>:0,Tr(X) = l \Uij\<p 
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in the variables X e S" and U e S". This yields the following dual to H3.2|l: 

, , minimize \^^'^^{A + U) 

subject to \Uij\ < p, i,j = l,...,n, 

which is a maximum eigenvalue problem with variable C/ S S". This gives a natural 
robustness interpretation to the relaxation in (|3.2|) : it corresponds to a worst-case 
maximum eigenvalue computation, with componentwise bounded noise of intensity p 
imposed on the matrix coefficients. 

Let us first remark that p in (|3.2|l corresponds to the optimal Lagrange multiplier 
in lE2Il- Also, the KKT conditions (see §5.9.2]) for problem ^^ and 1(231) are 
given by: 

{{A + U)X = y^'^''{A + U)X 
UoX = p\X\ 
Tr(X) = l, X>Q 
\U.ij\ < p, i,j = l,...,n. 

If the eigenvalue X'^^'^^{A + U) is simple (when, for example, X"^'^'^{A) is simple and p is 
sufficiently small), the ffist condition means that Rank(X) = 1 and the semidefinite 
relaxation is tight, with in particular Card{X) = Card{x)^ if x is the dominant eigen- 
vector of X. When the optimal solution X is not of rank one because of degeneracy 
(i.e. when X'^'^^{A + U) has multiplicity strictly larger than one), we can truncate X 
as in llHj , retaining only the dominant eigenvector x as an approximate solution to 
the original problem. In that degenerate scenario however, the dominant eigenvector 
of X is not guaranteed to be as sparse as the matrix itself. 

4. Sparse Decomposition. Using the results obtained in the previous two sec- 
tions we obtain a sparse equivalent to the PCA decomposition. Given a matrix 
Ai G S", our objective is to decompose it in factors with target sparsity k. We 
solve the relaxed problem in (|2.3|) : 



(4.1) 



maximize Tr{AiX) 
subject to Tr(X) = 1 
l^\X\l<k 

X yo. 



Letting Xi denote the solution, we truncate Xi , retaining only the dominant (sparse) 
eigenvector xi . Finally, we deflate Ai to obtain 

A2 = Ai- (xf Aixi)xixJ , 

and iterate to obtain further components. The question is now: When do we stop the 
decomposition? 

In the PCA case, the decomposition stops naturally after Rank(A) factors have 
been found, since ^Rank(A)-i-i is then equal to zero. In the case of the sparse de- 
composition, we have no guarantee that this will happen. Of course, we can add an 
additional set of linear constraints xfXxi = to problem 1)4. l|l to explicitly enforce 
the orthogonality of xi , . . . , a:„ and the decomposition will then stop after a maxi- 
mum of n iterations. Alternatively, the robustness interpretation gives us a natural 
stopping criterion: if all the coefficients in \Ai\ are smaller than the noise level p* 
(computed in the last section) then we must stop since the matrix is essentially indis- 
tinguishable from zero. Thus, even though we have no guarantee that the algorithm 
will terminate with a zero matrix, in practice the decomposition will terminate as 
soon as the coefficients in A become indistinguishable from the noise. 



6 A. D'ASPREMONT, L. EL GHAOUI, M. I. JORDAN AND G.R.G. LANCKRIET 

5. Algorithm. For small problems, the semidefinite program H4.1f) can be solved 
efficiently using interior-point solvers such as SEDUMI |23| or SDPT3 [2S] . For larger- 
scale problems, we need to resort to other types of convex optimization algorithms 
because the 0{n^) constraints implicitly contained in 1^|X|1 < k make the memory 
requirements of Newton's method prohibitive. Of special interest are the algorithms 
recently presented in |^ El Bj ■ These are first-order methods specialized to prob- 
lems such as H3.3|l having a specific saddle-point structure. These methods have a 
significantly smaller memory cost per iteration than interior-point methods and en- 
able us to solve much larger problems. Of course, there is a price: for fixed problem 
size, the first-order methods mentioned above converge in 0(l/e) iterations, where e 
is the required accuracy on the optimal value, while interior-point methods converge 
as 0(log(l/e)). Since the problem under consideration here does not require a high 
degree of precision, this slow convergence is not a major concern. In what follows, we 
adapt the algorithm in 21 to our particular constrained eigenvalue problem. 

5.1. A Smoothing Technique. The numerical difficulties arising in large scale 
semidefinite programs stem from two distinct origins. First, there is an issue of 
memory: beyond a certain problem size n, it becomes essentially impossible to form 
and store any second order information (Hessian) on the problem, which is the key to 
the numerical efficiency of interior-point SDP solvers. Second, smoothness is an issue: 
the constraint JT ^ is not smooth, hence the number of iterations required to solve 
problem H2.3|) using first-order methods such as the bundle code of fSj (which do not 
form the Hessian) to an accuracy e is given by 0(l/e^). In general, this complexity 
bound is tight and cannot be improved without additional structural information on 
the problem. Fortunately, in our case we do have structural information available that 
can be used to bring the complexity down from 0(l/e^) to 0(l/e). Furthermore, the 
cost of each iteration is equivalent to that of computing a matrix exponential (roughly 
0{n')). 

Recently, |23 and [201 (see also ^]) proposed an efficient first-order scheme for 
convex minimization based on a smoothing argument. The main structural assump- 
tion on the function to nnnimize is that it has a saddle-function format: 

(5.1) f{x) = f{x) + max{(rx, u) - 0(u) : u G Q2} 

u 

where / is defined over a compact convex set Qi C R", f{x) is convex and differen- 
tiable and has a Lipschitz continuous gradient with constant M > 0, T is an element 
of R" " and 0(m) is a continuous convex function over some closed compact set 
Q2 C R". This assumes that the function (/)(u) and the set Q2 are simple enough so 
that the optimization subproblem in u can be solved very efficiently. When a function 
/ can be written in this particular format, |21| uses a smoothing technique to show 
that the complexity (number of iterations required to obtain a solution with absolute 
precision e) of solving: 

(5.2) nun f{x) 

x£Qi 

falls from 0(l/e^) to 0(l/e). This is done in two steps. 

Regularization. By adding a strongly convex penalty to the saddle function 
representation of / in H5.1|l . the algorithm first computes a smooth e- approximation 
of / with Lipschitz continuous gradient. This can be seen as a generalized Moreau- 
Yosida regularization step (see Jl] for example) . 
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Optimal first-order minimization. The algorithm then applies the optimal 
first-order scheme for functions with Lipschitz continuous gradient detailed in jl8| 
to the regularized function. Each iteration requires an efficient computation of the 
regularized function value and its gradient. As we will see, this can be done explicitly 
in our case, with a complexity of 0{n^) and memory requirements of 0{n^). 

5.2. Application to Sparse PCA. Given a symmetric matrix A S S", we 
consider the problem given in H3.3|l (where we can assume without loss of generality 
that p = 1): 

maximize Tr{AX) - 1^|X|1 

(5.3) subject to Tr(X) = 1 

xyo, 

in the variable X e S". Duality allows us to rewrite this in the saddle- function format: 

(5.4) min /([/), 

where 

Qi = {[/eS": \U,j\ <1, z,j = l,...,n}, Q2 = {X G S" : TrX = 1, X^Q} 

f{U) := maxxeQ,(rC/,X) - 0(X), with T = /„2, 4>{X) = -Tr{AX). 

As in ^21,, to Qi and Q2 we associate norms and so-called prox- functions. To Qi, we 
associate the Frobenius norm in S", and a prox- function defined for U ^ Qi by: 

di{U) = \u'^U. 
With this choice, the center Uq of Qi, defined as: 

C/q '■— arg min di{U), 

f7tEQi 

is C/q = 0, and satisfies di{Uo) — 0. Moreover, we have: 

Di := max dUU) ^r? 12. 
ueQi 

Furthermore, the function di is strongly convex on its domain, with convexity param- 
eter of ai = 1 with respect to the Frobenius norm. Next, for Q2 we use the dual of 
the standard matrix norm (denoted |1 • II2), and a prox- function 

d^iX) ^ Tr{X log X) + login), 

where logX refers to the matrix (and not componentwise) logarithm, obtained by 
replacing the eigenvalues of X by their logarithm. The center of the set Q2 is Xo = 
n^^In, where d2{Xo) — 0. We have 

max d2(X) < logn := Z?2- 
xeQ2 

The convexity parameter of d2 with respect to || • II2, is bounded below by (T2 = 1- 
(This non-trivial result is proved in [20|.') 
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Next we compute the (1,2) norm of the operator T introduced above, which is 
defined as: 

||T||i,2:= max (TX,C/> : \\U\\f = 1, \\X\\;^1 

A ,U 

= max IIXII2 : \\X\\f < 1 

= 1. 

To summarize, the parameters defined above are set as follows: 

Di = nV2, (71 = 1, D2= log(n), da = 1, ||T||i,2 = 1. 

Let us now explicitly formulate how the regularization and smooth minimization tech- 
niques can be applied to the variance maximization problem in H5.3|l . 

5.2.1. Regularization. The method in ^T] first sets a regularization parameter 

The method then produces an e-suboptimal optimal value and corresponding subop- 
timal solution in a number of steps not exceeding 



^^, ^ 4||r||i ,2 ID1D2 



0x02 

The non-smooth objective /(X) of the original problem is replaced with 

where j^ is the penalized function involving the prox- function d2'- 

U(U) := max(rC/,X) - 0(X) - iid2{X). 

Note that in our case, the function /^ and its gradient are readily computed; see 
below. The function /^ is a smooth uniform approximation to / everywhere on Q2, 
with maximal error fiD2 = e/2. Furthermore, /^ has a Lipschitz continuous gradient, 
with Lipschitz constant given by: 

^ ._ ^211711^2 
£2(72 

In our specific case, the function /^ can be computed explicitly as: 

/^(C/) = A*log(Trexp((A + U)/^i)) - ^l\ogn, 

which can be seen as a smooth approximation to the function f{U) — Amax(^ + U). 
This function /^ has a Lipshitz-continuous gradient and is a uniform approximation 
of the function /. 
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5.2.2. First-order minimization. An optimal gradient algorithm for mini- 
mizing convex functions with Lipschitz continuous gradients is then applied to the 
smooth convex function /^ defined above. The key difference between the minimiza- 
tion scheme developed in J18j and classical gradient minimization methods is that it is 
not a descent method but achieves a complexity of 0(L/fc^) instead of 0{l/k) for gra- 
dient descent, where k is the number of iterations and L the Lipschitz constant of the 
gradient. Furthermore, this convergence rate is provably optimal for this particular 
class of convex minimization problems (see ^| Th. 2.1.13]). Thus, by sacrificing the 
(local) properties of descent directions, we improve the (global) complexity estimate 
by an order of magnitude. 

For our problem here, once the regularization parameter /i is set, the algorithm 
proceeds as follows. 



Repeat: 

1. Compute f^{Uk) and \7f^[Uk) 

2. Find Yk - argminyes, {Vf,,{Uk),Y) + ^L\\Uk - Y\\% 

3. FindW^fc = argmiuH^es, {^%?^ + Eto ^(^(t^O + (V^((70, W - C/.))} 

4. Set L/fe+i = T^,W, + |±in 
Until gap < e. 

Step one above computes the (smooth) function value and gradient. The second 
step computes the gradient mapping, which matches the gradient step for uncon- 
strained problems (see ^1 p.86]). Step three and four update an estimate sequence 
see (^1 p. 72]) of /^ whose minimum can be computed explicitly and gives an in- 
creasingly tight upper bound on the minimum of /^ . We now present these steps in 
detail for our problem (we write U for Uk and X for Xk). 

Step 1. The most expensive step in the algorithm is the first, the computation of 
/^ and its gradient. Setting Z — A + U, the problem boils down to computing 

(5.5) u*{z) :— arg max {Z,X) — fid2{X) 

and the associated optimal value f^(U). It turns out that this problem has a very 
simple solution, requiring only an eigenvalue decomposition for Z = A + U. The 
gradient of the objective function with respect to Z is set to the maximizer u*{Z) 
itself, so the gradient with respect to U is V/^(C/) — u* {A + U) . 

To compute u*{Z), we form an eigenvalue decomposition Z = VDV^ , with D =- 
diag(d) the matrix with diagonal d, then set 

exp(^^^-f-) . ^ 



En / di — dn 



maxjj^]^ „j dj is used to mitigate problems with large numbers. We 
then let u*(z) = VHV^, with H = diag(/i). The corresponding function value is 
given by: 



UU) = ^^\og[Trcxp' ^^ + ^^ 



which can be reliably computed as: 



= ^log f X! '^^P ( — ) ) " /^log' 




fp.(.U) =dmax + Mlog >^exp( ) - filogn. 
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Step 2. This step involves a problem of the form: 

argrnin {V UiU),Y) + hwu - Y\\%, 
y t wi z 

where U is given. The above problem can be reduced to a Euclidean projection: 
(5.6) arg min ||y — "l^j|i?, 

I M 1 1 oc ^ J- 

where V = U — L^^V f ^{U) is given. The solution is given by: 

Y.j = sgn(T/ij)min(|t/„|, 1), i,j = 1, . . . ,n. 

Step 3. The third step involves solving a Euclidean projection problem similar to 
(|5.6(l . with V defined by: 

4 = 

Stopping criterion. We can stop the algorithm when the duality gap is smaller 
than e: 

gapfc - An,ax(A + Uk) -TrAXk + \^\Xk\\ < e, 

where Xk = u*{{A + Uk)/ IJ-) is our current estimate of the dual variable (the function 
u* is defined by H5.5|l ). The above gap is necessarily non-negative, since both X^ 
and Uk are feasible for the primal and dual problem, respectively. This is checked 
periodically, for example every 100 iterations. 

Complexity. Since each iteration of the algorithm requires computing a matrix 
exponential (which requires an eigenvalue decomposition and 0{n^) flops in our code), 
the predicted worst-case complexity to achieve an objective with absolute accuracy 
less than e is ^Tj : 



e V crio-2 

In some cases, this complexity estimate can be improved by using specialized algo- 
rithms for computing the matrix exponential (see |16| for a discussion). For example, 
computing only a few eigenvalues might be sufficient to obtain this exponential with 
the required precision (see [5]). In our preliminary experiments, the standard tech- 
nique using Pade approximations, implemented in packages such as Expokit (see |22|'). 
required too much scaling to be competitive with a full eigenvalue decomposition. 

6. Numerical results & Applications. In this section, we illustrate the ef- 
fectiveness of the proposed approach (called DSPCA in what follows) both on an 
artificial data set and a real-life data set. We compare with the other approaches 
mentioned in the introduction: PCA, PCA with simple thresholding, SCoTLASS and 
SPCA. The results show that our approach can achieve more sparsity in the principal 
components than SPCA (the current state-of-the-art method) does, while explaining 
as much variance. The other approaches can explain more variance, but result in prin- 
cipal components that are far from sparse. We begin by a simple example illustrating 
the link between k and the cardinality of the solution. 
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6.1. Artificial data. To compare the numerical performance with that of exist- 
ing algorithms, we consider the simulation example proposed by ,30' . In this example, 
three hidden factors are created: 

Fi-7V(0,290), ^2 -^AA(0,300), Vg = -O.SVi + 0.925^2 + e, e-7V(0,300) 

with Vi,V2 and e independent. Afterward, 10 observed variables are generated as 
follows: 

X. = V,+4, el^AA(0,l), 

with j = 1 for i = 1, . . . , 4, j = 2 for i = 5, . . . , 8 and j = 3 for i = 9, 10 and e^ 
independent for j — 1,2,3, i — 1, . . . , 10. We use the exact covariance matrix to 
compute principal components using the different approaches. 

Since the three underlying factors have roughly the same variance, and the first 
two are associated with four variables while the last one is associated with only two 
variables, Vi and V2 are almost equally important, and they are both significantly 
more important than V3. This, together with the fact that the first two principal 
components explain more than 99% of the total variance, suggests that considering two 
sparse linear combinations of the original variables should be sufficient to explain most 
of the variance in data sampled from this model j30j . The ideal solution would thus 
be to use only the variables (Ai, A2, A3, A4) for the first sparse principal component, 
to recover the factor Vi, and only (A5, Ag, Ay, Ag) for the second sparse principal 
component to recover V2. 

Using the true covariance matrix and the oracle knowledge that the ideal spar- 
sity is four, [30! performed SPCA (with A = 0). We carry out our algorithm with 
fc = 4. The results are reported in Table IHTI together with results for PCA, simple 
thresholding and SCoTLASS (i = 2). Notice that DSPCA, SPCA and SCoTLASS all 
find the correct sparse principal components, while simple thresholding yields inferior 
performance. The latter wrongly includes the variables Ag and Aio (likely being mis- 
led by the high correlation between V2 and V3), moreover, it assigns higher loadings 
to Ag and Aio than to each of the variables (A5, Ag, A7, Ag) that are clearly more 
important. Simple thresholding correctly identifies the second sparse principal com- 
ponent, probably because Vi has a lower correlation with V3. Simple thresholding 
also explains a bit less variance than the other methods. 

Table 6.1 
Loadings and explained variance for the first two principal components of the artificial exam- 
ple. Here, "ST" denotes the simple thresholding method, "other" is all the other methods: SPCA, 
DSPCA and SCoTLASS. PCI and PC2 denote the first and second principal components. 





Xi 


X2 


X3 


Xi 


Xs 


Xe 


X'r 


Xs 


X9 


^10 


explained varianee 


PCA, PCI 
PCA, PC2 


.116 

-.478 


.116 

-.478 


.116 

-.478 


.116 

-.478 


-.395 

-.145 


-.395 
-.145 


-.395 
-.145 


-.395 

-.145 


-.401 
.010 


-.401 
.010 




60.0% 
39.6% 


ST, PCI 
ST, PC2 



-.5 



-.5 



-.5 




-.5 










-.497 



-.497 



-.503 



-.503 





38.8% 
38.6% 


other, PCI 
other, PC2 



.5 



.5 




.5 




.5 


.5 



.5 



.5 



.5 













40.9% 
39.5% 



6.2. Pit props data. The pit props data (consisting of 180 observations and 13 
measured variables) was introduced by [2j and is another benchmark example used to 
test sparse PCA codes. Both SCoTLASS [TT] and SPCA ^ have been tested on this 
data set. As reported in [30], SPCA performs better than SCoTLASS in the sense 
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that it identifies principal components with 7, 4, 4, 1, 1, and 1 non-zero loadings, 
respectively, as shown in Table 15.21 This is much sparser than the modified principal 
components by SCoTLASS, while explaining nearly the same variance (75.8% versus 
78.2% for the 6 first principal components) jJOl- Also, simple thresholding of PCA, 
with a number of non-zero loadings that matches the result of SPCA, does worse than 
SPCA in terms of explained variance. 

Following this previous work, we also consider the first 6 principal components. 
We try to identify principal components that are sparser than those of SPCA, but 
explain the same variance. Therefore, we choose values for fc of 5, 2, 2, 1, 1, 1 (two 
less than the values of SPCA reported above, but no less than 1). Figure |0] shows 
the cumulative number of non-zero loadings and the cumulative explained variance 
(measuring the variance in the subspace spanned by the first i eigenvectors). It 
can be seen that our approach is able to explain nearly the same variance as the 
SPCA method, while clearly reducing the number of non-zero loadings for the first six 
principal components. Adjusting the first value of k from 5 to 6 (relaxing the sparsity), 
we obtain results that are still better in terms of sparsity, but with a cumulative 
explained variance that is uniformly larger than SPCA. Moreover, as in the SPCA 
approach, the important variables associated with the six principal components do 
not overlap, which leads to a clearer interpretation. Table W!^ shows the first three 
corresponding principal components for the different approaches (DSPCAw5 denotes 
runs with ki — 5 and DSPCAwG uses fci =6). 

Table 6.2 
Loadings for first three principal components, for the pit props data. DSPCAwS (resp. DSP- 
CAw6) shows the results for our technique with ki equal to 5 (resp. 6). 





topd 


length 


moist 


tcstsg 


ovcnsg 


ringt 


ringb 


bowm 


bowd 


whorls 


clear 


knots 


diaknot 


SPCA 1 
SPCA 2 
SPCA 3 


-.477 




-.476 






.785 





.620 




.177 



.640 






.589 


-.250 



.492 


-.344 

-.021 




-.416 




-.400 











.013 








-.015 


DSPCAw5 1 
DSPCAw5 2 
DSPCAw5 3 


-.560 




-.583 






.707 






.707 













-.793 


-.263 



-.610 


-.099 




-.371 




-.362 


















.012 


DSPCAw6 1 
DSPCAw6 2 
DSPCAw6 3 


-.491 




-.507 






.707 





.707 









-.067 



-.873 


-.357 


-.484 


-.234 




-.387 




-.409 


















.057 



6.3. Controlling sparsity with k. We present a simple example to illustrate 
how the sparsity of the solution to our relaxation evolves as k varies from 1 to n. We 
generate a 10 x 10 matrix U with uniformly distributed coefficients in [0, 1]. We let v 
be a sparse vector with: 

^=(1,0,1,0,1,0,1,0,1,0). 

We then form a test matrix A = U^U + avv'^ , where tr is a signal-to-noise ratio that 
we set equal to 15. We sample 50 different matrices A using this technique. For each 
value of k between 1 and 10 and each A, we solve the following SDP: 

max Tr{AX) 

subject to Tr(A:) = 1 

1^|X|1 <fc 

X hO. 

We then extract the first eigenvector of the solution X and record its cardinality. In 
Figure 16.21 we show the mean cardinality (and standard deviation) as a function of 
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a 
-5 



3 
o 





Number of principal components 



Number of principal components 



Fig. 6.1. Cumulative cardinality and percentage of total variance explained versus number of 
principal components, for SPCA and DSPCA on the pit props data. The dashed lines are SPCA 
and the solid ones are DSPCA with fci = 5 and k\ = 6. On the right, the dotted line also shows 
the percentage of variance explained by standard (non sparse) PCA. While explaining the same 
cumulative variance, our method (DSPCA) produces sparser factors. 



k. We observe that fc + 1 is actually a good predictor of the cardinality, especially 
when A: + 1 is close to the actual cardinality (5 in this case). In fact, in the random 
examples tested here, we always recover the original cardinality of 5 when fc + 1 is set 
to 5. 



>, 



o3 



o3 







k Problem size n 

Fig. 6.2. Left: Plot of average cardinality (and its standard deviation) versus k for 100 random 
examples with original cardinality 5. Right: Plot of CPU time (in seconds) versus problem size for 
randomly chosen problems. 



6.4. Computing Time versus Problem Size. In Figure loTzl we plot the total 
CPU time used for randomly chosen problems of size n ranging from 100 to 800. The 
required precision was set to e = 10""^, which was always reached in fewer than 60000 
iterations. In these examples, the empirical complexity appears to grow as 0(n'^). 

6.5. Sparse PCA for Gene Expression Data Analysis. We are given m 
data vectors Xj G R'\ with n = 500. Each coefficient Xij corresponds to the ex- 
pression of gene i in experiment j. For each vector Xj we are also given a class 



Cj e {0,1,2,3}. We form A = 



the covariance matrix of the experiment. Our 



objective is to use PCA to first reduce the dimensionality of the problem and then 
look for clustering when the data are represented in the basis formed by the first three 



14 



A. D'ASPREMONT, L. EL GHAOUI, M. I. JORDAN AND G.R.G. LANCKRIET 



principal components. Here, we do not apply any clustering algorithm to the data 
points, we just assign a color to each sample point in the three dimensional scatter 
plot, based on known experimental data. 

The sparsity of the factors in sparse PCA implies that the clustering can be 
attributed to fewer genes, making interpretation easier. In Figure lFTSl we see clustering 
in the PCA representation of the data and in the DSPCA representation. Although 
there is a slight drop in the resolution of the clusters for DSPCA, the key feature 
here is that the total number of nonzero gene coefficients in the DSPCA factors is 
equal to 14 while standard PCA produces three dense factors, each with 500 nonzero 
coefficients. 



PCA 




Sparse PCA 







Fig. 6.3. Clustering of the gene expression data in the PCA versus sparse PCA basis with 
500 genes. The factors f on the left are dense and each use all 500 genes while the sparse factors 
31, 92 and gz on the right involve 6, 4 '^'"■d, 4 genes respectively. (Data: Iconix Pharmaceuticals) 
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